TIRE-Seq human brain neurosphere clustering

Recap

These are the first experiments where I use version 2 chemistry. Read 2 is now the barcode and UMI. Read 1 is the cDNA. This was done to avoid reading through the TSO on Ilumina read 1

This sequencing run consists of 3 experiments:

  1. Test the version 2 protocol with magnetic poly-T Dynabeads https://au-mynotebook.labarchives.com/share/Piper_Research_Project/MjAuOHwxNzE3NjUvMTYvVHJlZU5vZGUvNDE3OTUxMTA4N3w1Mi44
  2. Benchmark the version 2 protocol with a mixture of human and mouse cell lines Daniel Brown generated https://au-mynotebook.labarchives.com/share/Piper_Research_Project/MjkuOTAwMDAwMDAwMDAwMDAyfDE3MTc2NS8yMy9UcmVlTm9kZS8yMzM4NDg4MTE1fDc1Ljg5OTk5OTk5OTk5OTk5
  3. Test version 2 protocol with a biological appliation from Zac Moore of Brain Cancer lab https://au-mynotebook.labarchives.com/share/Piper_Research_Project/MzEuMjAwMDAwMDAwMDAwMDAzfDE3MTc2NS8yNC9UcmVlTm9kZS8zMzczNDM0NjE1fDc5LjE5OTk5OTk5OTk5OTk5

Check the samples cluster by their cell type

Read SCE and preprocessing

This was generated in notebook 1A_generateSCE.

sce <- readRDS(here::here(
  "data/TIRE_brain_human/SCEs/", "brainCancer_basic.sce.rds"
   ))

Library size normalization and transformation

set.seed(666)

lib.sf <- librarySizeFactors(sce)
sce <- computeSumFactors(sce)
sce <- logNormCounts(sce)

Feature selection

Here we follow the example of https://bioconductor.org/books/release/OSCA/feature-selection.html#variance-of-the-log-counts

dec.sce <- modelGeneVar(sce)
fit.sce <- metadata(dec.sce)
hvg.sce.var <- getTopHVGs(dec.sce, n=1000)
sce <- scater::runPCA(sce, subset_row=hvg.sce.var, ncomponents=10)

Visulaise the fit. According to OSCA book:

At any given abundance, we assume that the variation in expression for most genes is driven by uninteresting processes like sampling noise. Under this assumption, the fitted value of the trend at any given gene’s abundance represents an estimate of its uninteresting variation, which we call the technical component.

Therefore the efitted line represents technical variation of which is much higher in version 1 where I used the read object compared to version 2 where the UMI object is used.

We then define the biological component for each gene as the difference between its total variance and the technical component. This biological component represents the “interesting” variation for each gene and can be used as the metric for HVG selection.

Build the object

tb <- as_tibble(cbind(fit.sce$mean, fit.sce$var))

colnames(tb) <- c("Mean", "Variance")

Make the plot of technical and biological variation.

plt1 <- ggplot(tb, 
             aes(x = Mean, y= Variance)) + 
  geom_point(alpha = 0.2, size=0.5) + 
  guides(colour = guide_legend(override.aes = list(size=2, alpha=1))) +
  xlab("Mean of log-expression") + 
  ylab("Variance of log-expression") +
  scale_colour_brewer(type="qualitative", palette = "Dark2") +
  geom_function(fun = fit.sce$trend, colour = "darkgreen") +
  theme_Publication(base_size = 16)

plt1
The coloured line represents technical variation

The coloured line represents technical variation

PCA plots all samples

This includes my samples too. Separation is very clearly human and mouse.

plt1 <- plotPCA(sce, colour_by="Cell_Line") + theme_Publication()

plt1

Filter for Brain cancer and remove low quality samples

This distort the PCA and should be removed anyway when I do DE

tb <- as_tibble(colData(sce))

plt1 <- ggplot(tb, 
             aes(x = sum, y= detected, colour=Storage)) + 
  geom_point(size=1.5) + 
  xlab("UMI counts") + 
  ylab("Genes detected") +
  scale_x_continuous(trans='log10') +
  annotation_logticks(base = 10, sides = "b") +
  ylim(0,20000) +
  scale_colour_brewer(type="qualitative", palette = "Dark2") +
  theme_Publication(base_size = 16)

Remove the low quality samples. Focus on the brain cancer samples that were stored frozen.

sce <- sce[,sce$Researcher == "ZM"]
sce <- sce[,sce$Storage == "Freezer"]
sce <- sce[,sce$sum > 60000]

Plot library size again after removing low quality samples.

tb <- as_tibble(colData(sce))

plt2 <- ggplot(tb, 
             aes(x = sum, y= detected, colour=Storage)) + 
  geom_point(size=1.5) + 
  xlab("UMI counts") + 
  ylab("Genes detected") +
  scale_x_continuous(trans='log10') +
  annotation_logticks(base = 10, sides = "b") +
  ylim(0,20000) +
  scale_colour_brewer(type="qualitative", palette = "Dark2") +
  theme_Publication(base_size = 16)

plt1 / plt2

Library size normalization and transformation

set.seed(666)
lib.sf <- librarySizeFactors(sce)
sce <- computeSumFactors(sce)
sce <- logNormCounts(sce)

Feature selection

There is a lot more variation than the example at https://bioconductor.org/books/release/OSCA/feature-selection.html#variance-of-the-log-counts

Choose 1000 genes instead of standard 1000 because of the presence of human and mouse genes.

dec.sce <- modelGeneVar(sce)
fit.sce <- metadata(dec.sce)
hvg.sce.var <- getTopHVGs(dec.sce, n=1000)
sce <- runPCA(sce, subset_row=hvg.sce.var, ncomponents=10)

Visulaise the fit. According to OSCA book:

At any given abundance, we assume that the variation in expression for most genes is driven by uninteresting processes like sampling noise. Under this assumption, the fitted value of the trend at any given gene’s abundance represents an estimate of its uninteresting variation, which we call the technical component.

Therefore the efitted line represents technical variation of which is much higher in version 1 where I used the read object compared to version 2 where the UMI object is used.

We then define the biological component for each gene as the difference between its total variance and the technical component. This biological component represents the “interesting” variation for each gene and can be used as the metric for HVG selection.

Build the object

tb <- as_tibble(cbind(fit.sce$mean, fit.sce$var))
colnames(tb) <- c("Mean", "Variance")

Plot mean variance of high quality glioma samples

plt1 <- ggplot(tb, 
             aes(x = Mean, y= Variance)) + 
  geom_point(alpha = 0.2, size=0.5) + 
  guides(colour = guide_legend(override.aes = list(size=2, alpha=1))) +
  xlab("Mean of log-expression") + 
  ylab("Variance of log-expression") +
  scale_colour_brewer(type="qualitative", palette = "Dark2") +
  geom_function(fun = fit.sce$trend, colour = "darkgreen") +
  theme_Publication(base_size = 16)

plt1
The coloured line represents technical variation

The coloured line represents technical variation

PCA plots high quality glioma samples

Convert some drug metadata to factors

sce$Dose_M <- as.numeric(sce$Dose_M)
sce$Day_Exposure <- as.factor(sce$Day_Exposure)

# Reorder the levels
sce$Drug <- factor(sce$Drug, levels = c("TMZ", "DMSO"))
print(sce$Drug)
##  [1] TMZ  TMZ  TMZ  TMZ  TMZ  TMZ  TMZ  DMSO TMZ  TMZ  TMZ  TMZ  DMSO TMZ  TMZ 
## [16] TMZ  TMZ  TMZ  DMSO TMZ  TMZ  TMZ  TMZ  TMZ  TMZ  TMZ  DMSO TMZ  TMZ  TMZ 
## [31] TMZ  TMZ  TMZ  TMZ  TMZ  TMZ  TMZ  TMZ  TMZ  TMZ  TMZ  TMZ  TMZ  TMZ  TMZ 
## [46] TMZ  TMZ  DMSO DMSO TMZ  TMZ  TMZ  TMZ  DMSO TMZ  TMZ  TMZ  TMZ  TMZ  TMZ 
## [61] DMSO TMZ  TMZ  TMZ  TMZ 
## Levels: TMZ DMSO

The dose is related to the number of genes detected

plt1 <- plotPCA(sce, colour_by="Drug") + 
  theme_Publication()

plt2 <- plotPCA(sce, colour_by="Dose_M", shape_by="Day_Exposure", point_size=2.5) +
  theme_Publication()

plt3 <- plotPCA(sce, colour_by="Day_Exposure", shape_by="Drug", point_size=2.5) + 
  theme_Publication()

plt4 <- plotPCA(sce, colour_by="detected") + 
  theme_Publication()

View timepoint and Drug. Set the Drug as a a letter.

pca_tb <- as_tibble(reducedDim(sce))
pca_tb$Day_Exposure <- sce$Day_Exposure
pca_tb$Drug <- substr(x = sce$Drug, start = 1, stop = 1)

ggplot(pca_tb, aes(x=PC1, y=PC2, colour=Day_Exposure, label=Drug)) +
  geom_text(size=4.5) +
  xlab("PC1 (35%)") + ylab("PC2 (5%)") +
  scale_colour_brewer(palette = "Dark2", name = "Days") +
  theme_Publication()

View library size and Drug. Set the Drug as a a shape.

Some relationship with library size but I don’t think it explains everything

pca_tb <- as_tibble(reducedDim(sce))
pca_tb$sum <- log(sce$sum+1)
pca_tb$Drug <- sce$Drug

ggplot(pca_tb, aes(x=PC1, y=PC2, colour=sum, shape=Drug)) +
  geom_point(size=3) +
  xlab("PC1 (35%)") + ylab("PC2 (5%)") +
  theme_Publication()

Conventional visualisation

plotPCA(sce, colour_by="Day_Exposure", shape_by="Drug", point_size=2.5) + 
  theme_Publication()

View dose and timepoint

plt2

Save subset SCE

saveRDS(sce, here::here(
    "data/TIRE_brain_human/SCEs/", "brainCancer_subset.sce.rds"
   ))

Session info

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Red Hat Enterprise Linux 9.4 (Plow)
## 
## Matrix products: default
## BLAS:   /stornext/System/data/software/rhel/9/base/tools/R/4.4.1/lib64/R/lib/libRblas.so 
## LAPACK: /stornext/System/data/software/rhel/9/base/tools/R/4.4.1/lib64/R/lib/libRlapack.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggthemes_5.1.0              here_1.0.1                 
##  [3] knitr_1.48                  patchwork_1.3.0            
##  [5] scater_1.32.1               scran_1.32.0               
##  [7] scuttle_1.14.0              lubridate_1.9.3            
##  [9] forcats_1.0.0               stringr_1.5.1              
## [11] dplyr_1.1.4                 purrr_1.0.2                
## [13] readr_2.1.5                 tidyr_1.3.1                
## [15] tibble_3.2.1                ggplot2_3.5.1              
## [17] tidyverse_2.0.0             SingleCellExperiment_1.26.0
## [19] SummarizedExperiment_1.34.0 Biobase_2.64.0             
## [21] GenomicRanges_1.56.2        GenomeInfoDb_1.40.1        
## [23] IRanges_2.38.1              S4Vectors_0.42.1           
## [25] BiocGenerics_0.50.0         MatrixGenerics_1.16.0      
## [27] matrixStats_1.4.1          
## 
## loaded via a namespace (and not attached):
##  [1] gridExtra_2.3             rlang_1.1.4              
##  [3] magrittr_2.0.3            compiler_4.4.1           
##  [5] DelayedMatrixStats_1.26.0 vctrs_0.6.5              
##  [7] pkgconfig_2.0.3           crayon_1.5.3             
##  [9] fastmap_1.2.0             XVector_0.44.0           
## [11] labeling_0.4.3            utf8_1.2.4               
## [13] rmarkdown_2.28            tzdb_0.4.0               
## [15] UCSC.utils_1.0.0          ggbeeswarm_0.7.2         
## [17] xfun_0.48                 bluster_1.14.0           
## [19] zlibbioc_1.50.0           cachem_1.1.0             
## [21] beachmat_2.20.0           jsonlite_1.8.9           
## [23] highr_0.11                DelayedArray_0.30.1      
## [25] BiocParallel_1.38.0       irlba_2.3.5.1            
## [27] parallel_4.4.1            cluster_2.1.6            
## [29] R6_2.5.1                  RColorBrewer_1.1-3       
## [31] bslib_0.8.0               stringi_1.8.4            
## [33] limma_3.60.6              jquerylib_0.1.4          
## [35] Rcpp_1.0.13               Matrix_1.7-0             
## [37] igraph_2.0.3              timechange_0.3.0         
## [39] tidyselect_1.2.1          rstudioapi_0.17.0        
## [41] abind_1.4-8               yaml_2.3.10              
## [43] viridis_0.6.5             codetools_0.2-20         
## [45] lattice_0.22-6            withr_3.0.1              
## [47] evaluate_1.0.1            pillar_1.9.0             
## [49] generics_0.1.3            rprojroot_2.0.4          
## [51] hms_1.1.3                 sparseMatrixStats_1.16.0 
## [53] munsell_0.5.1             scales_1.3.0             
## [55] glue_1.8.0                metapod_1.12.0           
## [57] tools_4.4.1               BiocNeighbors_1.22.0     
## [59] ScaledMatrix_1.12.0       locfit_1.5-9.10          
## [61] cowplot_1.1.3             edgeR_4.2.2              
## [63] colorspace_2.1-1          GenomeInfoDbData_1.2.12  
## [65] beeswarm_0.4.0            BiocSingular_1.20.0      
## [67] vipor_0.4.7               cli_3.6.3                
## [69] rsvd_1.0.5                fansi_1.0.6              
## [71] S4Arrays_1.4.1            viridisLite_0.4.2        
## [73] gtable_0.3.5              sass_0.4.9               
## [75] digest_0.6.37             SparseArray_1.4.8        
## [77] ggrepel_0.9.6             dqrng_0.4.1              
## [79] farver_2.1.2              htmltools_0.5.8.1        
## [81] lifecycle_1.0.4           httr_1.4.7               
## [83] statmod_1.5.0